Using Masks, Suffix Array-Based Data Structures And Multidimensional Arrays To Compute Positional Ngram Statistics From Corpora
نویسندگان
چکیده
This paper describes an implementation to compute positional ngram statistics (i.e. Frequency and Mutual Expectation) based on masks, suffix array-based data structures and multidimensional arrays. Positional ngrams are ordered sequences of words that represent continuous or discontinuous substrings of a corpus. In particular, the positional ngram model has shown successful results for the extraction of discontinuous collocations from large corpora. However, its computation is heavy. For instance, 4.299.742 positional ngrams (n=1..7) can be generated from a 100.000-word size corpus in a seven-word size window context. In comparison, only 700.000 ngrams would be computed for the classical ngram model. It is clear that huge efforts need to be made to process positional ngram statistics in reasonable time and space. Our solution shows O(h(F) N log N) time complexity where N is the corpus size and h(F) a function of the window context.
منابع مشابه
The Virtual Corpus Approach to Deriving Ngram Statistics from Large Scale Corpora
This paper reports our implementation of the Virtual Corpus approach to deriving ngram statistics for ngrams of any length from large-scale corpora based on the suffix array data structure. In order to enable the VC to accommodate corpora with a vocabulary of different size, we first convert corpus tokens into integer codes. To accelerate the processing, we employ a bucket-radixsort for sorting...
متن کاملEntropy-Compressed Indexes for Multidimensional Pattern Matching
In this talk, we will discuss the challenges involved in developing a multidimensional generalizations of compressed text indexing structures. These structures depend on some notion of Burrows-Wheeler transform (BWT) for multiple dimensions, though naive generalizations do not enable multidimensional pattern matching. We study the 2D case to possibly highlight combinatorial properties that do n...
متن کاملUsing Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus
Bigrams and trigrams are commonly used in statistical natural language processing; this paper will describe techniques for working with much longer ngrams. Suffix arrays were first introduced to compute the frequency and location of a substring (ngram) in a sequence (corpus) of length N . To compute frequencies over all N(N+1)/2 substrings in a corpus, the substrings are grouped into a manageab...
متن کاملAn Efficient Language Model Using Double-Array Structures
Ngram language models tend to increase in size with inflating the corpus size, and consume considerable resources. In this paper, we propose an efficient method for implementing ngram models based on doublearray structures. First, we propose a method for representing backwards suffix trees using double-array structures and demonstrate its efficiency. Next, we propose two optimization methods fo...
متن کاملParallel Suffix Arrays for Corpus Exploration
This paper describes how recently developed techniques for suffix array construction and compression can be expanded to bring a new data structure, called parallel suffix array, into existence, which is suitable as an in-memory representation of large annotated corpora, enabling complex queries and fast extractions of the context of matching substrings. It is also shown how parallel suffix arra...
متن کامل